Evaluating static models on RTEB
The group of researchers associated with the Massive Text Embedding Benchmark (MTEB) has released a new benchmark: the Retrieval Text Embedding Benchmark. As you may know, MTEB ranks models on their ability to perform well at a variety of tasks in a zero-shot setting, and is meant to reflect how well your model transfers to new tasks. Ranking high on MTEB can make or break your model, so it has become something that people optimize for, and as Goodhart put it: “when a measure becomes a target, it ceases to become a good measure”.
Comparing PCA and MRL for static models
Without reducing dimensionality, static models can be hundreds of MB large. Choosing the right dimensionality-reduction technique can shrink them without sacrificing retrieval quality. I was always a huge fan of Principal Component Analysis (PCA) for making static models smaller. For example, PCA is used in model2vec and was used in an older version of tokenlearn to post-process models, and is used in the newer version of tokenlearn to reduce the dimension of the teacher models.(1) PCA was always on my mind as a good option for reducing dimensions. Recently, however, I started experimenting with Matryoshka Representation Learning (MRL) for reducing dimensions and have found it to be superior, which I found surprising. This blog post thus tries to answer the question: when should you be using PCA and MRL? If one is better than the other, why? I discuss both techniques, why applying dimensionality reduction to static embeddings makes sense, and some options for future work.
-
I am no longer a maintainer or owner of these projects, but added the functionality while I was still at Minish. ↩
Static late interaction models
Late interaction is an interesting paradigm for computing the similarity between two documents, and can be seen as a hybrid of sparse and dense retrieval. In this post, I will show how static models in a late interaction setting actually reduce to sparse models. I will also argue that, in absence of empirical evidence to the contrary, there’s no good reason to assume that static late interaction models will be much better than their dense counterparts. But first, let’s dive into some fundamentals: I’ll explain what sparse retrieval and dense retrieval are, and how late interaction fits in with both of those paradigms.
Better Greedy Tokenizers: Handling WordPiece's [UNK] Problem
In a previous post, I showed that making a tokenizer greedy, that is, always picking the longest matching subword like WordPiece does, can improve results without retraining. But WordPiece
can unfortunately silently break your tokenization.
Note: alternative to regex splitting in byte tokenizers
In a previous note, I discussed an alternative for setting split
to true in a ByteLevel
pretokenizer. I suggested using a ByteLevel
normalizer first, and then splitting using a complicated regex in “byte space”. However, this turned out to not work very well: there are certain character classes in an original Regex, such as \s
, that are very difficult to convert to a pattern in byte space.
Turning any tokenizer into a greedy one
I recently re-read Greed is All You Need: An Evaluation of Tokenizer Inference Methods. In this paper, the authors show that switching out inference methods for tokenizers can improve performance on various tasks.
Tokenizer decasing
In this post I will talk about something I call tokenizer de_casing. _decasing is very similar to putting a lowercase normalizer in front a of a tokenizer, but works better.
Using overload to handle tagged union return types
Here’s a function with an idiom I’ve seen a lot (probably copied from sentence-transformers
):